MITRA AI Chatbot Assistant: Using 8-Bit Quantized Large Language Models on Consumer-Grade GPUs

Authors: Yash Pal, Ganesh Sharma, Harshad Bhutal, Pranav Patil, Asst. Prof. Shital Gujar

DOI Link: https://doi.org/10.22214/ijraset.2026.79010

Abstract

This paper explains the MITRA AI Chatbot Assistant, a locally-hosted AI chatbot designed to run on consumer-grade GPUs, using 8-bit quantized Large Language Models (LLMs). This project also makes use of proprietary cloud-based LLMs like ChatGPT, Gemini, Grok, and DeepSeek using their APIs, but for privacy concerns, this system has made use of local models such as Mistral 7B and Llama 3. Due to these local models, there’s no need for constant internet connectivity. MITRA enables private, low-latency inference using quantization techniques like GPTQ and GGUF. Due to quantization, the model size was reduced to 7-8 GB, enabling deployment on consumer-grade hardware without depending on commercial hardware, which is very expensive. MITRA can assist in multiple domains, such as education, medical, therapeutic, coding, etc., in use cases.

Introduction

The paper presents MITRA, a locally-deployed AI assistant designed to overcome limitations of cloud-based LLMs such as privacy risks, high costs, latency, and lack of user control. By leveraging quantization techniques (GPTQ, GGUF) to reduce models like Mistral 7B from 14–15 GB to 7–8 GB of VRAM, MITRA enables high-performance inference on consumer-grade hardware (e.g., NVIDIA RTX 3050).

MITRA integrates quantized LLMs with lightweight inference frameworks (llama.cpp), a user-friendly interface (React, voice input), and a REST API (FastAPI) for seamless operation. It supports offline use, multiple models, and diverse applications in education, coding, medical guidance, and creative tasks. Comparative analysis shows MITRA excels in privacy, cost-efficiency, offline capability, and user control, while maintaining competitive performance despite smaller model size.

Testing results indicate near real-time interaction (~30 tokens/sec), ~95% performance retention after quantization, and stable VRAM usage (~7–8 GB). Limitations include lower performance for highly complex tasks, dependency on GPUs, and minor quality loss from 8-bit quantization.

Future directions include mobile deployment with more aggressive quantization, fine-tuning for domain-specific applications, multimodal support, federated learning, and advanced quantization methods to further optimize efficiency while maintaining accessibility. MITRA demonstrates that sophisticated AI assistance can be democratized for consumer hardware without sacrificing privacy or practical usability.

Conclusion

MITRA demonstrates that a locally hosted, privacy-preserving AI assistant is both feasible and practical on consumer-grade hardware. By leveraging modern quantization techniques, lightweight inference frameworks, and an optimized system architecture, the proposed solution provides a viable alternative to traditional cloud-dependent AI platforms. The system effectively addresses several critical limitations associated with existing approaches, including high operational costs, concerns related to data privacy, dependence on continuous internet connectivity, and limited user control. While certain trade-offs exist in terms of model size and inference speed, the system maintains a balanced performance that is sufficient for a wide range of applications. As a result, MITRA effectively serves educational, personal, and research-oriented use cases. Furthermore, as the demand for accessible and privacy-aware AI systems continues to grow, MITRA contributes toward the broader goal of democratizing advanced language technologies. The approach presented in this work highlights the potential of edge AI and localized deployment strategies in expanding the reach of intelligent systems. This work is expected to encourage further research in efficient model optimization, decentralized AI deployment, and real-world applications of local language models. Future developments in this direction may lead to more robust, scalable, and production-ready solutions that bridge the gap between performance and accessibility.

References

[1] C. Frantar, S. Ashkboos, T. Stutz, and D. Alistarh, “GPTQ: Accurate post-training quantization for generative pre-trained transformers,” arXiv preprint arXiv:2210.17323, 2023. [2] A. Q. Jiang et al., “Mistral 7B,” arXiv preprint arXiv:2310.06825, 2023. [3] X. Wang et al., “Phi-2: The surprising power of small language models,” Microsoft Research, 2023. [4] P. G. Kelley et al., “A framework for understanding unintended consequences of machine learning,” Journal of Privacy Research, 2023. [5] J. Kaplan et al., “Scaling laws for neural language models,” arXiv preprint arXiv:2001.08361, 2020. [6] B. Jacob et al., “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” in Proc. IEEE CVPR, 2018. [7] Hugging Face, “Transformers library,” 2023. [Online]. Available: https://huggingface.co/transformers [8] Chen, Y., Wang, X., Li, Z., & Smith, J. (2023). Feasibility Study of Edge Computing Empowered by AI: A Quantitative Analysis Based on Large Models. IEEE Transactions on Edge Computing, 15(3), 234–251. [9] Kumar, A., Patel, S., Zhang, L., & Johnson, M. (2023). FlexQuant: Elastic Quantization Framework for Locally Hosted Large Language Models. In Proceedings of the Conference on Systems and Machine Learning (SysML 2023), 45–62.

Copyright

Copyright © 2026 Yash Pal, Ganesh Sharma, Harshad Bhutal, Pranav Patil, Asst. Prof. Shital Gujar. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET79010

Publish Date : 2026-03-30

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here